{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# LoC Data Package Tutorial: General Collections Assessment\n", "\n", "This notebook will demonstrate basic usage of using Python for interacting with [data packages from the Library of Congress](https://data.labs.loc.gov/packages/) via the [General Collections Assessment Data Package](https://data.labs.loc.gov/gen-coll-assessment/) which is derived from an ongoing program to assess the Library's approximately 22 million books, bound serials and other materials classified under the [General Collections](https://www.loc.gov/rr/coll-general.html). We will:\n", "\n", "1. [Read and query metadata from a data package](#Query-the-metadata-in-a-data-package)\n", "2. [Visualize the data](#Visualize-the-data)\n", "\n", "## Prerequisites\n", "\n", "In order to run this notebook, please follow the instructions listed in [this directory's README](https://github.com/LibraryOfCongress/data-exploration/blob/master/Data%20Packages/README.md)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Query the metadata in a data package\n", "\n", "First we will download a the Children's Literature assessment metadata file, print a summary of the items' location values, then filter by a particular location." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loaded metadata file with 331,145 entries.\n" ] } ], "source": [ "import io\n", "\n", "import pandas as pd # for reading, manipulating, and displaying data\n", "import requests\n", "\n", "DATA_URL = 'https://data.labs.loc.gov/gen-coll-assessment/'\n", "\n", "metadata_url = f'{DATA_URL}chi.csv' # Also try 'localhistory_us.csv' or 'philosphy.csv'\n", "response = requests.get(metadata_url, timeout=60)\n", "metadata_string = response.text\n", "\n", "# Read contents as a csv file\n", "df = pd.read_csv(io.StringIO(metadata_string), dtype=str, keep_default_na=False)\n", "print(f'Loaded metadata file with {df.shape[0]:,} entries.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next let's print the available properties" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Lccn, Title, Author, Publisher, Language, Begin Publication Date, Format, Country, Display Call Number, Holdings Location Display Name, Holdings Location, Begin Publication Date (Decade), Region, Subclass, US NonUS, Isbn, Issn, Language Code, Place Code, Place of Publication, State or Country\n" ] } ], "source": [ "print(', '.join(df.columns.to_list()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next print the top 10 most frequent locations in this dataset" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | State or Country | \n", "
---|---|
New York (State) | \n", "164607 | \n", "
Massachusetts | \n", "22867 | \n", "
Illinois | \n", "12702 | \n", "
Minnesota | \n", "11068 | \n", "
California | \n", "11065 | \n", "
England | \n", "11012 | \n", "
Pennsylvania | \n", "9800 | \n", "
No place, unknown, or undetermined | \n", "8803 | \n", "
Germany | \n", "4611 | \n", "
Wisconsin | \n", "4341 | \n", "